APA 2025
Unlike typical testing, where items have a known answer . . .
In forecasting, we do not know the answer at the time of testing
To measure forecasting ability, usually score with outcome (ground truth)
Using peer comparisons to score forecasts instead of using outcomes
Why would this work?
Wisdom of the Crowds: when many predictions are aggregated, individual errors tend to cancel out
(Atanasov & Himmelstein, 2023; Atanasov et al., 2017; Galton, 1907)
Using peer comparisons to score forecasts instead of outcomes
Why would this work?
Wisdom of the Crowds: when many predictions are aggregated, individual errors tend to cancel out
These properties may make crowd aggregates a good substitution for the outcome to score in real time
(Atanasov & Himmelstein, 2023; Atanasov et al., 2017; Galton, 1907)
N = 1,000 forecasters
K = 1,000 items
Generated forecasts for each forecaster on each item
Sampled combinations of:
Scored each forecast with:
Repeated 200 times
Which scoring measure captures forecasters’ skill better?
Intersubjective scoring correlation increases with N
For N ≥ 16, intersubjective scoring captures original skill parameter better than ground-truth
Increased variance affects ground-truth scoring but not intersubjective scoring
Intersubjective scoring has stronger correlations with skill at lower NK combinations than before
Types of intersubjective measures
Tested for their real-time scoring ability
Surprisingly successful at predicting forecasting accuracy
But have only been explored in the context of the Probability Elicitation Format (PEF)
(Himmelstein et al., 2023)
Previously identified skilled forecasters
Aggregating these superforecasters may serve as a better reference criterion than aggregating the general crowd
Proxy scoring with a superforecaster crowd aggregate has been an effective method
(Karger et al., 2021; Tetlock & Gardner, 2015)
Are intersubjective measures still good predictors of forecasting accuracy in the (superior reliability) Quantile Elicitation Format (QEF)?
Are proxy scores or metapredictions stronger predictors of forecasting accuracy?
Final wave of a longitudinal forecasting study (N = 894)
Forecasts on six items in the QEF
Metapredictions after each item:
Additional sample of N = 42 superforecasters
(Himmelstein et al., 2025)
Scored each forecast with:
Compared these scores to forecasting accuracy on separate set of twelve items
Ground-truth score distributions
Superforecaster aggregate more accurate more often
Superforecaster metapredictions and proxy scores strongest
How much variance in forecaster accuracy does each score explain?
Random effects for item and person, conducted dominance analysis
| Contribution | Proportion | |
|---|---|---|
| Proxy Super | .13 | .23 |
| Metaprediction Super | .12 | .22 |
| Proxy Crowd | .11 | .20 |
| Metaprediction Crowd | .09 | .17 |
| Ground Truth | .09 | .17 |
| \(R_{forecaster}^2\) | .52 | 1.00 |
(Azen & Budescu, 2003; Budescu, 1993; Luo & Azen, 2013)
Intersubjective measures still effective measure of forecasting ability
Intersubjective measures still effective measure of forecasting ability
Can reduce measurement error by modifying the scoring criterion
Atanasov, P., & Himmelstein, M. (2023). Talent spotting in crowd prediction. In M. Seifert (Ed.), Judgment in Predictive Analytics. Springer.
Atanasov, P., Rescober, P., Stone, E., Swift, S. A., Servan-Schreiber, E., Tetlock, P., Ungar, L., & Mellers, B. (2017). Distilling the Wisdom of Crowds: Prediction Markets vs. Prediction Polls. Management Science, 63(3), 691–706. https://doi.org/10.1287/mnsc.2015.2374
Galton, F. (1907). Vox Populi. Nature, 75(1949), 450–451. https://doi.org/10.1038/075450a0
Himmelstein, M., Budescu, D. V., & Ho, E. H. (2023). The wisdom of many in few: Finding individuals who are as wise as the crowd. Journal of Experimental Psychology: General, 152(5), 1223–1244. https://doi.org/doi.org/10.1037/xge0001340
Himmelstein, M., Zhu, S. M., Petrov, N., Karger, E., Helmer, J., Livnat, S., Bennett, A., Hedley, P., & Tetlock, P. (2025). The Forecasting Proficiency Test: A General Use Assessment of Forecasting Ability. OSF. https://doi.org/10.31234/osf.io/a7kdx
Karger, E., Monrad, J., Mellers, B., & Tetlock, P. (2021). Reciprocal Scoring: A Method for Forecasting Unanswerable Questions. SSRN Electronic Journal. https://doi.org/10.2139/ssrn.3954498
Tetlock, P. E., & Gardner, D. (2015). Superforecasting: The Art and Science of Prediction. Crown.
Wilkening, T., Martinie, M., & Howe, P. D. L. (2022). Hidden Experts in the Crowd: Using Meta-Predictions to Leverage Expertise in Single-Question Prediction Problems. Management Science, 68(1), 487–508. https://doi.org/10.1287/mnsc.2020.3919
Zhu, S. M., Budescu, D. V., Petrov, N., Karger, E., & Himmelstein, M. (2024). The psychometric properties of probability and quantile forecasts. Preprint.
Questions?
Contact: jhelmer3@gatech.edu